Skip to content

Fix time source mismatch in client RPC path after #3268 migration#3283

Merged
chenBright merged 1 commit intoapache:masterfrom
hjwsm1989:fix-la-time-source-mismatch
Apr 26, 2026
Merged

Fix time source mismatch in client RPC path after #3268 migration#3283
chenBright merged 1 commit intoapache:masterfrom
hjwsm1989:fix-la-time-source-mismatch

Conversation

@hjwsm1989
Copy link
Copy Markdown
Contributor

@hjwsm1989 hjwsm1989 commented Apr 25, 2026

已按 @chenBright#issuecomment-4319830604 的建议改成"回退 LocalityAwareLoadBalancer 这一侧的改动",比迁移整条调用链更局部、更保守。

这个 PR 解决什么问题

#3268("Use monotonic time instead of wall time")把
LocalityAwareLoadBalancer::Weight::Updateend_time_us
LocalityAwareLoadBalancer::Describenow 切到了
butil::cpuwide_time_us(),但所有产生 CallInfo::begin_time_us 的调用
方还在用 butil::gettimeofday_us()

  • Channel::CallMethodchannel.cpp:451)→ Controller::IssueRPC
    Controller::Call::begin_time_usSelectIn::begin_time_us
    CallInfo::begin_time_us
  • Controller::OnVersionedRPCReturned 中的 backup-request 与普通重试入口
    controller.cpp:672, 715)都是 IssueRPC(butil::gettimeofday_us())

时钟域错配导致:

const int64_t end_time_us = butil::cpuwide_time_us();         // 单调时钟,约 5e6
const int64_t latency = end_time_us - ci.begin_time_us;       // - 墙上时钟(约 1.7e15)
                                                              // ≈ -1.7e15 us
if (latency <= 0) {
    // time skews, ignore the sample.
    return 0;
}

每个 sample 都被 time-skew 短路丢掉。_time_q 永远拿不到新条目,
_avg_latency 一直是 0,locality-aware 反馈机制实际被静默关掉。

下游可观测的现象:冷启动场景下,list:// channel + lb=la + 2 个
backend、max_retry=1,第一个 backend 不通时,retry 偶发返回
EHOSTDOWNFail to select server from list://...),即使第二个
backend 是健康的。

Bisect / 复现

我在内部 fork 上对 brpc 1.16.1 与 1.17.0-rc2 之间的 51 个 commit 做了
bisect(master 代码不变,只换 brpc 版本),每一步在 2-backend
list:// + lb=la + max_retry=1 下跑探测 RPC 重复 500 次:

步骤 commit 失败
good c41e838a Release 1.16.1 0/500
... (4 个 GOOD 步骤) ...
#47 771de31e Fix UAF in LatencyRecorder 0/500
#48 12fb539a Use monotonic time (#3268) 25/500
#49 b223e60b Fix RDMA resource 30/500
bad 604dad0c 1.17.0-rc2 28/500

12fb539a 上做单行隔离实验:只把

 int64_t LocalityAwareLoadBalancer::Weight::Update(const CallInfo& ci, ...) {
-    const int64_t end_time_us = butil::cpuwide_time_us();
+    const int64_t end_time_us = butil::gettimeofday_us();

撤回(保留 #3268 其余所有改动),测试就回到 0/500 PASS。这把回归精确钉
到了那一行 end_time_us 切换。

改动内容

@chenBright 的建议,把 LocalityAwareLoadBalancer 一侧从 #3268 引入
的 cpuwide 改动回退,让它继续使用 gettimeofday_us(),与现有调用方对
齐:

文件 行号 改动
src/brpc/policy/locality_aware_load_balancer.cpp 21 include 注释回退到 gettimeofday_us
src/brpc/policy/locality_aware_load_balancer.cpp 379 Weight::Update::end_time_us 回退
src/brpc/policy/locality_aware_load_balancer.cpp 527 Describe::now 回退

Channel::CallMethodController::OnVersionedRPCReturned 中那些
butil::gettimeofday_us() 调用都不用动;Controller::_begin_time_us
Controller::latency_us() 等公开 API 的墙上时钟语义保持不变。

测试

新增 test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source
不依赖 Server / Channel 的 LB 单元测试。它驱动若干次
SelectServer + Feedback 循环,begin_time_us
butil::gettimeofday_us()(与 Channel::CallMethod 当前行为一致),并
断言 _avg_latency 反映了真实的过期时间,而不是被 time-skew 短路卡在
0。

如果将来有人再把 Weight::Update::end_time_us 切走,或者某个
CallInfo::begin_time_us 的调用方漂到了别的时钟域,这个测试都会失败。

不起 Server / 不依赖网络,CI 上是确定性结果(本地 ~16ms)。
la_sanityla_selection_too_long 也都仍然通过。

Check List

  • 没有公开 API 变更。
  • 新增测试 la_records_latency_with_consistent_time_source
  • 向后兼容:Controller::_begin_time_us / latency_us() 保持墙上
    时钟语义不变。

@hjwsm1989
Copy link
Copy Markdown
Contributor Author

hjwsm1989 commented Apr 25, 2026

Hi @chenBright — could you take a look when you have time? This PR completes the time-source migration started in #3268; the bisect in the description pinpoints LocalityAwareLoadBalancer::Weight::Update's end_time_us (switched to cpuwide_time_us() in #3268) being inconsistent with the wall-clock CallInfo::begin_time_us plumbed in from Channel::CallMethod and the retry sites in Controller::OnVersionedRPCReturned. Thanks!

@chenBright
Copy link
Copy Markdown
Contributor

I think it would be more appropriate to roll back the changes to LocalityAwareLoadBalancer.

@hjwsm1989
Copy link
Copy Markdown
Contributor Author

I think it would be more appropriate to roll back the changes to LocalityAwareLoadBalancer.

ok

PR apache#3268 ("Use monotonic time instead of wall time") switched
LocalityAwareLoadBalancer::Weight::Update's end_time_us and
LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(),
but every caller that supplies CallInfo::begin_time_us still uses
butil::gettimeofday_us():

  - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC ->
    Controller::Call::begin_time_us -> SelectIn::begin_time_us ->
    CallInfo::begin_time_us
  - Controller::OnVersionedRPCReturned retry sites
    (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on
    backup-request and regular retries

The mismatched time domains make

    latency = end_time_us - ci.begin_time_us
            = cpuwide_now - wallclock_begin
            ~= -1.7e15 us

trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }`
short-circuit on every call. _time_q never accumulates samples,
_avg_latency stays at 0, and locality-aware weight feedback is silently
disabled.

Visible downstream symptom: cold-start `list://` channels with `lb=la`
and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select
server from list://...") on retry even when one backend is healthy.
Bisected reproduction in xsky/brpc fork:

  - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2)
  - master code + LA-driven multipath probe at 2 backends, max_retry=1,
    repeat 500x:
      * commit 771de31 (one before apache#3268): 0/500 fail
      * commit 12fb539 (apache#3268):           25/500 fail
      * commit 12fb539 + revert only Weight::Update::end_time_us to
        gettimeofday_us:                    0/500 fail

This commit reverts the LA-side of apache#3268's clock change so the LB lines
up with its existing callers again. Channel::CallMethod and the retry
paths in Controller stay on butil::gettimeofday_us(), which preserves
the wall-clock semantics of Controller::_begin_time_us /
Controller::latency_us() that public users rely on.

Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source
which drives a series of SelectServer + Feedback cycles against
LocalityAwareLoadBalancer (no Server / Channel needed) and asserts
that _avg_latency reflects the elapsed time, rather than being stuck
at 0 because of a time-source mismatch.
@hjwsm1989 hjwsm1989 force-pushed the fix-la-time-source-mismatch branch from 9a6dda3 to 2c285ff Compare April 25, 2026 14:47
@hjwsm1989
Copy link
Copy Markdown
Contributor Author

@chenBright 谢谢 review,已按你的建议改成回退 LocalityAwareLoadBalancer 一侧的 #3268 改动(Weight::Update::end_time_usDescribe::now 与 include 注释),不动 Channel::CallMethod / Controller::OnVersionedRPCReturned 的调用方,这样 Controller::_begin_time_us / latency_us() 的墙上时钟语义保持原样。

新增的 la_records_latency_with_consistent_time_source 测试也对应调整:直接用 butil::gettimeofday_us() 作为 CallInfo::begin_time_us(匹配现有 caller),断言 _avg_latency > 0 —— 如果将来再有人把 Weight::Update 的时钟切走,或者哪个 caller 漂到别的时钟域,都会被这个测试抓到。

force-pushed 到 2c285ff2,PR description 也同步更新到了新的 revert 思路下。再麻烦你看一下,谢谢!

@wwbmmm
Copy link
Copy Markdown
Contributor

wwbmmm commented Apr 26, 2026

LGTM

Copy link
Copy Markdown
Contributor

@chenBright chenBright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenBright chenBright merged commit 6ddefce into apache:master Apr 26, 2026
31 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants